Recognizing Multi-Talker Speech with Permutation Invariant Training
نویسندگان
چکیده
In this paper, we propose a novel technique for direct recognition of multiple speech streams given the single channel of mixed speech, without first separating them. Our technique is based on permutation invariant training (PIT) for automatic speech recognition (ASR). In PIT-ASR, we compute the average cross entropy (CE) over all frames in the whole utterance for each possible output-target assignment, pick the one with the minimum CE, and optimize for that assignment. PIT-ASR forces all the frames of the same speaker to be aligned with the same output layer. This strategy elegantly solves the label permutation problem and speaker tracing problem in one shot. Our experiments on artificially mixed AMI data showed that the proposed approach is very promising.
منابع مشابه
Single-Channel Multi-talker Speech Recognition with Permutation Invariant Training
Although great progresses have been made in automatic speech recognition (ASR), significant performance degradation is still observed when recognizing multi-talker mixed speech. In this paper, we propose and evaluate several architectures to address this problem under the assumption that only a single channel of mixed signal is available. Our technique extends permutation invariant training (PI...
متن کاملRecognizing spoken vowels in multi-talker babble: spectral and visual speech cues
It has been proposed that both spectral and visual speech cues assist in segregating a talker from noise. To test how these cues interact, the experiment examined vowel identification (in hVd context) when presented in multi-talker babble. The availability of spectral cues was manipulated by filtering the signal into (1) 8 frequency amplitude-envelope bands or (2) the same bands with additional...
متن کاملDifferences in talker recognition by preschoolers and adults.
Talker variability in speech influences language processing from infancy through adulthood and is inextricably embedded in the very cues that identify speech sounds. Yet little is known about developmental changes in the processing of talker information. On one account, children have not yet learned to separate speech sound variability from talker-varying cues in speech, making them more sensit...
متن کاملThe effects of talker variability and variances on incidental learning of lexical tones
Multi-talker variability has been found to be very effective in the perception and production training of nonnative sound categories in the past few decades. The phonetic training paradigms were mostly explicit learning in which learners received feedback of the categories when exposed to the training stimuli. More recently, studies have started to investigate how auditory categories are learne...
متن کاملSpeaker-Invariant Training via Adversarial Learning
We propose a novel adversarial multi-task learning scheme, aiming at actively curtailing the inter-talker feature variability while maximizing its senone discriminability so as to enhance the performance of a deep neural network (DNN) based ASR system. We call the scheme speaker-invariant training (SIT). In SIT, a DNN acoustic model and a speaker classifier network are jointly optimized to mini...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2017